AITopics | instructional video

Collaborating Authors

instructional video

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsFeb-17-2026, 08:35:27 GMT

Instructional "how-to" videos online allow users to master new skills and everyday DIY tasks, from

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Israel (0.04)

Genre:

Research Report > Promising Solution (0.68)
Instructional Material > Course Syllabus & Notes (0.43)

Industry:

Education > Educational Technology > Audio & Video (0.53)
Education > Educational Technology > Media (0.43)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

804e757b7d7043c26701c3a313032101-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-16-2026, 04:05:30 GMT

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.61)

Industry:

Education (0.69)
Information Technology > Software (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

acaa23f71f963e96c8847585e71352d6-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 19:30:53 GMT

computer vision, dataset, noun, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsDec-26-2025, 21:30:28 GMT

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

keystep recognition, name change, video-mined task graph, (2 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.32)

Industry:

Education > Educational Technology > Media (0.68)
Education > Educational Technology > Audio & Video (0.68)

Technology: Information Technology > Artificial Intelligence (0.79)

Add feedback

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Neural Information Processing SystemsDec-24-2025, 08:21:46 GMT

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

instructional video, name change, self-supervised spatial grounding, (7 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.30)

Industry:

Education > Educational Technology > Media (0.66)
Education > Educational Technology > Audio & Video (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Kim, Joochan, Jung, Minjoon, Zhang, Byoung-Tak

arXiv.org Artificial IntelligenceDec-8-2025

Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos. Due to the dominant action pair'Take-Background', the model fails to predict the action'Open.' Action recognition in instructional videos has witnessed remarkable progress, primarily driven by models that excel in curated benchmark datasets (Farha & Gall, 2019; Ishikawa et al., 2021; Li et al., 2020; Yi et al., 2021).

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2504.0658

Genre:

Research Report (0.84)
Instructional Material > Course Syllabus & Notes (0.66)

Industry:

Education > Educational Technology > Media (1.00)
Education > Educational Technology > Audio & Video (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Add feedback

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Pham, Eddison, Priyadarshini, Prisha, Maliackel, Adrian, Bandi, Kanishk, Meo, Cristian, Zhu, Kevin

arXiv.org Artificial IntelligenceDec-2-2025

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.23907

Country: North America > United States (0.68)

Genre: Research Report (0.66)

Industry: Education > Educational Technology > Audio & Video (0.36)

Technology: